Down-sampling speech representation in ASR

نویسندگان

Hynek Hermansky

Pratibha Jain

چکیده

Features for automatic speech recognition (ASR) are typically sampled at about 100 Hz (10 ms analysis step). Recent experiments indicate that the most e cient components of the modulation spectrum of speech for ASR are up to about 16 Hz [1]. Consequently, RASTA processing attenuates modulation frequencies higher than 16 Hz and should in principle allow for a subsequent down-sampling of the features. It has been shown earlier that in a Gaussian mixture model based speaker recognition system(which uses single state HMM, thus not requiring any time alignments of the incoming speech) one could down-sample the speech representation after RASTA ltering without any signi cant loss of performance [2]. However since ASR uses Viterbi time alignment, reduced number of time samples due to down-sampling, although justi ed by Nyquist criteria after the low-pass ltering, could create problems. In this paper we experimentally show that the downsampling of features after RASTA ltering is feasible and could result in considerable computational or at least storage/transmission savings. 1 Temporal processing Speech contains many source of information such as information about the linguistic message, about the speaker of the message, and about the communication channel used for the recording and transmission of the speech signal. For a given task, it is helpful to retain relevant source of information in extracted features while suppressing the irrelevant ones. In ASR, the task is to decode the linguistic message This linguistic message is coded in the movements of the vocal tract. The speech signal re ects these movements. The rate of change of the non-linguistic components in speech often lies outside the typical rate of change of the vocal tract shape. The RASTA [3] and LDA [4] techniques take advantage of this fact and bandpass lter time trajectories of speech feature vectors. 1.1 Exploiting the bandpass property RASTA lters out the fast (and slow) changes of spectral components over time. Since fast changes (high modulation frequency components) are eliminated by RASTA ltering, the Nyquist criterion would suggest that the RASTA ltered features could be sampled at a sampling rate slower than that of original nonltered features(Fig 2).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features

Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using t...

متن کامل

Informing multisource decoding in robust automatic speech recognition

Listeners are remarkably adept at recognising speech in natural multisource environments, while most Automatic Speech Recognition (ASR) technology fails in these conditions. It has been proposed that this human ability is governed by Auditory Scene Analysis (ASA) processes, in which a sound mixture is segregated into perceptual packages, called ‘streams’, by a combination of bottom-up and top-d...

متن کامل

Chapter 8: Acoustic Features and Distance Measure to Reduce Vulnerability of ASR Performance Due to the Presence of a Communication Channel and/or Background Noise

Saying that late 20th century automatic speech recognition (ASR) is pattern recognition, is something of a truism, but perhaps one of which the fundamental implications are not always fully appreciated. Essentially, a pattern recognition task boils down to measuring the distance between a physical representation of a new, as yet unknown token, and all elements of a set of pre-existing patterns,...

متن کامل

Evidence against Frame-based Analysis Techniques

The need of ∆, ∆∆, ∆∆∆, ∆∆∆∆.... measures is a clear sign of the loss in the representation capability of classical frame-based analysis techniques. Mainly coarticulation effects in fluent speech are hidden and obscured by the classical short-time analysis technique. In fact, almost every acceptable ASR system is forced to introduce this kind of post-processing technique, in order to obviate to...

متن کامل

Speech Representation Learning Using Unsupervised Data-Driven Modulation Filtering for Robust ASR

The performance of an automatic speech recognition (ASR) system degrades severely in noisy and reverberant environments in part due to the lack of robustness in the underlying representations used in the ASR system. On the other hand, the auditory processing studies have shown the importance of modulation filtered spectrogram representations in robust human speech recognition. Inspired by these...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Down-sampling speech representation in ASR

نویسندگان

چکیده

منابع مشابه

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features

Informing multisource decoding in robust automatic speech recognition

Chapter 8: Acoustic Features and Distance Measure to Reduce Vulnerability of ASR Performance Due to the Presence of a Communication Channel and/or Background Noise

Evidence against Frame-based Analysis Techniques

Speech Representation Learning Using Unsupervised Data-Driven Modulation Filtering for Robust ASR

عنوان ژورنال:

اشتراک گذاری